Multiple Alignment with Hidden Markov Models
نویسنده
چکیده
Multiple alignment of sequences is an important problem in bioinformatics. For example, multiple alignment of proteins belonging to the same family can provide valuable information for the common protein structure. Looking at a multiple alignment one can identify patterns which are not obvious if one looks at just pairwise alignments between members of the family. Traditional alignment schemes first choose a scoring matrix and gap penalties and then use dynamic programming which gives exact result in O(N) operations for two sequences of length N. For multiple alignment of K sequences dynamic programming requires O(N) operations, which is exponential in K. Therefore the algorithm is not feasible for multiple alignment of more than a few sequences. On the other hand, protein families often contain hundreds of sequences. To avoid the problem various heuristic algorithms were introduced. However, it is often the case that the solution to a multiple alignment problem depends strongly on the scoring scheme. This poses the question of how to choose the most biologically relevant one. Another very important issue is that the scoring matrices and gap penalties are assumed to be independent of position along the sequence, while gene families often exhibit regions which are highly conserved and regions which are highly variable. In order to obtain the most biologically relevant multiple alignment these differences should be taken into account. If we want to construct multiple alignments based only on primary structure information, the variations of scoring matrices and gap penalties should be deduced from the sequences themselves. Hidden Markov Models (HMMs) are an implementation of the idea that the scoring parameters should guide the multiple alignment as much as the alignment should determine the scoring parameters. A Hidden Markov Model (HMM), λ, is a stochastic machine for generating (amino acid) sequences. Different sequences are generated with different probability. A sequence S is generated with probability P (S|λ). The goal is: Given an universe, Λ, of HMMs to chose the member, λ∗, which maximizes ∏ k=1..K P (Sk|λ), where S1, ..., Sk are the sequences we try to align. The procedure of choosing λ∗ can be thought of as training a HMM using training set {S1, ..., SK}. This is analogous to neural networks. The hope is that, after the training, the model have captured the essence of “what it means to be a member of this particular protein family”. If so, λ∗ will preferably generate amino acid sequences which belong to the family, i.e. sequences that exhibit the same
منابع مشابه
A generalization of Profile Hidden Markov Model (PHMM) using one-by-one dependency between sequences
The Profile Hidden Markov Model (PHMM) can be poor at capturing dependency between observations because of the statistical assumptions it makes. To overcome this limitation, the dependency between residues in a multiple sequence alignment (MSA) which is the representative of a PHMM can be combined with the PHMM. Based on the fact that sequences appearing in the final MSA are written based on th...
متن کاملPropositionalisation of Multiple Sequence Alignments using Probabilistic Models
Multiple sequence alignments play a central role in Bioinformatics. Most alignment representations are designed to facilitate knowledge extraction by human experts. Additionally statistical models like Profile Hidden Markov Models are used as representations. They offer the advantage to provide sound, probabilistic scores. The basic idea we present in this paper is to use the structure of a Pro...
متن کاملMultiple Word Alignment with Profile Hidden Markov Models
Profile hidden Markov models (Profile HMMs) are specific types of hidden Markov models used in biological sequence analysis. We propose the use of Profile HMMs for word-related tasks. We test their applicability to the tasks of multiple cognate alignment and cognate set matching, and find that they work well in general for both tasks. On the latter task, the Profile HMM method outperforms avera...
متن کاملSequence Database Search Using Jumping Alignments
We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well balanced manner. This is in contrast to established methods like profiles and hidden Markov models which focus on vertical information as they model the columns of the alignment independe...
متن کاملBayesian Restoration of a Hidden Markov Chain with Applications to DNA Sequencing
Hidden Markov models (HMMs) are a class of stochastic models that have proven to be powerful tools for the analysis of molecular sequence data. A hidden Markov model can be viewed as a black box that generates sequences of observations. The unobservable internal state of the box is stochastic and is determined by a finite state Markov chain. The observable output is stochastic with distribution...
متن کاملApplications of hidden Markov models for comparative gene structure prediction
Identifying the structure in genome sequences is one of the principal challenges in modern molecular biology, and comparative genomics offers a powerful tool. In this paper we introduce a hidden Markov model that allows a comparative analysis of multiple sequences related by a phylogenetic tree. The model integrates structure prediction methods for one sequence, statistical multiple alignment m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001